{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "“3_测试LangChain_Notion_RAG_FAISS.ipynb” 中，用自定义的loader，通过notion api递归式地读取指定页面。\n",
    "https://github.com/unnikked/NotionRag\n",
    "在这个项目中发现langchain提供了一个读取notion数据库的loader。\n",
    "可以把需要进行RAG的page，全部放到一个指定的notion数据库中，直接赋予集成读取该数据库的权限，而不必每个page手动赋予集成权限。\n",
    "\n",
    "前置工作：\n",
    "1. 集成令牌（Integration Token）：需在 Notion 中创建集成（Integration）并获取 API 密钥。\n",
    "2. 数据库 ID：目标 Notion 数据库的唯一标识符（URL 中的哈希值）。\n",
    "3. 确保集成的权限设置允许访问目标数据库（在 Notion 中需邀请集成到数据库页面）。\n",
    "4. 后续如果在数据库中新增页面，此处即可读取。也可以将其他页面放入该数据库：创建副本 - 移动到该数据库，不支持仅插入该页面的url。\n",
    "\n",
    "pipeline：\n",
    "1. 指定一个notion database id（包含所有希望rag的page），通过Loader加载；\n",
    "2. 分割成n个片段，每个大小为500个token，两两之间重叠部分为50（10%）个token；\n",
    "3. 通过开源的BAAI/bge-m3或bge-large-zh-v1.5，对片段进行向量化，并通过Chroma向量数据库持久化的存储在本地；\n",
    "4. 设计RAG prompt和RAG chain；\n",
    "5. 测试GPT在有无RAG情况下的表现。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "测试NotionLoader：  \n",
    "创建一个Notion Database，和一个测试用的page  \n",
    "  <div style=\"display: flex; gap: 20px; align-items: center;\">\n",
    "    <img src=\"./assets/Notion4RAG.webp\" alt=\"Notion Database\" style=\"width: 30%;\">\n",
    "    <img src=\"./assets/截图 2025-03-14 16-36-46.png\" alt=\"Notion Database\" style=\"width: 15%;\">\n",
    "  </div>\n",
    "可见NotionLoader能正确获取各种block类型中的文本。\n",
    "\n",
    "注意在database中的page如果是url，如图中的“conda/pip”，则无法递归地获取文本。可以将“conda/pip”，创建副本 - 移动到database。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from dotenv import load_dotenv\n",
    "load_dotenv() # 在当前路径下新建一个.env文档，其中存储了代理信息\n",
    "\n",
    "NOTION_API_KEY = os.getenv(\"NOTION_API_KEY\")\n",
    "NOTION_DATABASE_ID = os.getenv(\"NOTION_DATABASE_ID\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.document_loaders import NotionDBLoader\n",
    "\n",
    "loader = NotionDBLoader(\n",
    "    integration_token=NOTION_API_KEY,\n",
    "    database_id=NOTION_DATABASE_ID,\n",
    "    request_timeout_sec=30  # 可选，API 超时时间\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "doc-1: \n",
      "metadata:  {'名称': 'ubuntu 20.04、22.04 汇总', 'id': '1b5b5aaf-4f1b-806d-9123-f191a141f526'}\n",
      "content:  Ubuntu Oracular 24.1\n",
      "doc-2: \n",
      "metadata:  {'名称': 'conda / pip', 'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26'}\n",
      "content:  我的ip似乎被清华源给封了，pip总是报\n",
      "doc-3: \n",
      "metadata:  {'名称': '测试NotionDBLoader', 'id': '1b5b5aaf-4f1b-8063-89b0-d995708eace9'}\n",
      "content:  from langchain.docum\n"
     ]
    }
   ],
   "source": [
    "i = 0\n",
    "for doc in documents:\n",
    "    i += 1\n",
    "    print(f\"doc-{i}: \")\n",
    "    print(\"metadata: \", doc.metadata)\n",
    "    print(\"content: \", doc.page_content[:20])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "chroma要求元数据值必须是 字符串（str）、数字（int/float）或布尔值（bool）。\n",
    "若元数据值为 None、list 或 dict，Chroma 会直接报错或静默丢弃该字段。\n",
    "上面的元数据的值都符合要求，都是字符串。\n",
    "```json\n",
    "{\n",
    "    '名称': 'ubuntu 20.04、22.04 汇总', \n",
    "    'id': 'xxx'\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'名称': 'ubuntu 20.04、22.04 汇总', 'id': '1b5b5aaf-4f1b-806d-9123-f191a141f526'}\n",
      "{'名称': 'conda / pip', 'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26'}\n",
      "{'名称': '测试NotionDBLoader', 'id': '1b5b5aaf-4f1b-8063-89b0-d995708eace9'}\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "def preprocess_metadata(metadata):\n",
    "    for key, value in metadata.items():\n",
    "        if value is None:\n",
    "            metadata[key] = ''\n",
    "        elif isinstance(value, list):\n",
    "            metadata[key] = ', '.join(value)\n",
    "        elif isinstance(value, dict):\n",
    "            metadata[key] = json.dumps(value)\n",
    "    return metadata\n",
    "\n",
    "for doc in documents:\n",
    "    print(preprocess_metadata(doc.metadata))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    chunk_size=500,  # 块大小\n",
    "    chunk_overlap=50  # 块间重叠\n",
    ")\n",
    "all_splits = text_splitter.split_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "215\n",
      "{'名称': 'ubuntu 20.04、22.04 汇总', 'id': '1b5b5aaf-4f1b-806d-9123-f191a141f526'}\n"
     ]
    }
   ],
   "source": [
    "print(len(all_splits))\n",
    "print(all_splits[0].metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 下载 Embedding 模型，从 HuggingFace 拉取模型至本地\n",
    "\n",
    "# ! cd models/sentence_embedding\n",
    "# ! git clone https://huggingface.co/BAAI/bge-m3\n",
    "# ! git clone https://huggingface.co/BAAI/bge-large-zh\n",
    "# ! git clone https://huggingface.co/BAAI/bge-large-zh-v1.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_huggingface import HuggingFaceEmbeddings\n",
    "\n",
    "model_name = \"../models/sentence_embedding/bge-large-zh-v1.5\"\n",
    "model_kwargs = {'device': 'cpu'}\n",
    "encode_kwargs = {'normalize_embeddings': False}\n",
    "embeddings = HuggingFaceEmbeddings(\n",
    "    model_name=model_name,\n",
    "    model_kwargs=model_kwargs,\n",
    "    encode_kwargs=encode_kwargs\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<langchain_community.vectorstores.chroma.Chroma at 0x7b3142973a60>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.vectorstores import Chroma\n",
    "\n",
    "CHROMA_DB_PERSIST_DIR = os.getenv(\"CHROMA_DB_PERSIST_DIR\")\n",
    "\n",
    "Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory=CHROMA_DB_PERSIST_DIR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_92515/551887959.py:1: LangChainDeprecationWarning: The class `Chroma` was deprecated in LangChain 0.2.9 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-chroma package and should be used instead. To use it run `pip install -U :class:`~langchain-chroma` and import as `from :class:`~langchain_chroma import Chroma``.\n",
      "  vectorstore = Chroma(embedding_function=embeddings, persist_directory=CHROMA_DB_PERSIST_DIR)\n"
     ]
    }
   ],
   "source": [
    "vectorstore = Chroma(embedding_function=embeddings, persist_directory=CHROMA_DB_PERSIST_DIR)\n",
    "retriever = vectorstore.as_retriever(\n",
    "    search_type=\"similarity\",  # mmr\n",
    "    search_kwargs={\"k\": 5},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.prompts import ChatPromptTemplate\n",
    "\n",
    "template = \"\"\"基于以下上下文回答问题：\n",
    "{context}\n",
    "\n",
    "问题：{question}\n",
    "\"\"\"\n",
    "prompt = ChatPromptTemplate.from_template(template)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_ollama import ChatOllama\n",
    "\n",
    "# 初始化本地Ollama模型\n",
    "llm = ChatOllama(\n",
    "    model=\"qwen2.5:32b\",\n",
    "    temperature=0.1,\n",
    "    base_url=\"http://localhost:11434\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "\n",
    "# 构建RAG链\n",
    "rag_chain = (\n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()} \n",
    "    | prompt \n",
    "    | llm \n",
    "    | StrOutputParser()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "要复制一个conda环境，可以使用`conda create`命令，并结合`--clone`选项。具体步骤如下：\n",
      "\n",
      "1. 使用以下命令来克隆现有的环境：\n",
      "   ```\n",
      "   conda create -n 新环境名 --clone 旧环境名\n",
      "   ```\n",
      "\n",
      "例如，如果你想要通过克隆名为 `py36` 的环境来创建一个叫做 `py36_bak` 的新环境，你可以运行如下命令：\n",
      "```\n",
      "conda create -n py36_bak --clone py36\n",
      "```\n"
     ]
    }
   ],
   "source": [
    "response = rag_chain.invoke(\"如何复制conda环境？\")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'context': [Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='v2calibration代码环境配置（conda）'), Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='下载最新当前（20230520）最新版本的conda：\\n\\tAnaconda3-2023.03-1-Linux-x86_64.sh\\n\\thttps://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2023.07-1-Linux-x86_64.sh\\n\\tbash Anaconda3-2021.11-Linux-x86_64.sh\\n\\t安装位置在：/home/hey/anaconda3\\n\\t自动初始化？yes\\n\\t（如果选择yes，则每次打开新的终端时，都会默认进入到base环境）'), Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='conda env list\\n\\t\\tconda info -e \\n\\t\\t$ pip list\\n\\t复制环境：conda create -n <new_env_name> --clone <origin_env_name>\\n\\t\\t通过克隆py36来创建一个称为py36_bak的副本：\\n\\t\\tconda create -n py36_bak --clone py36\\n\\t删除环境：\\n\\t\\tconda env remove -n <env_name>\\n\\t保存环境信息到environment.yaml文件中：conda env export > environment.yaml\\n\\t通过environment.yaml环境文件创建文件： conda env create -f environment.yaml\\n\\t查看已安装的包：conda list\\n\\t搜索包：\\n\\t\\t模糊搜索：conda search <package_name1>\\n\\t\\tconda list | findstr torch\\n\\t\\tconda search --full-name <package name> 搜索指定的包'), Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='conda search --full-name <package name> 搜索指定的包\\n\\t\\t\\t比如 conda search --full-name tensorflow 显示所有的tensorflow包。\\n\\t安装包：conda install <package_name1> <package_name2>\\n\\t卸载包：\\n\\t\\tconda remove <package_name>\\n\\t\\t\\tconda remove --name [env_name] --all，回车后出现一系列在这个环境所安装的包；输入【y】进行环境的删除。\\n\\t\\tconda uninstall package_name(包名)\\n\\tcopy环境：\\n\\tconda create\\n\\t --\\n\\tname 新环境名\\n\\t --\\n\\tclone 旧环境名\\n\\t检查更新当前的conda版本\\n\\t\\tconda update conda\\n\\t参考'), Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='$ python test.py \\n4.10.0\\n/home/hey/anaconda3/envs/test/lib/python3.8/site-packages/cv2.cpython-38-x86_64-linux-gnu.so\\n# 验证一下，完成任务了\\n\\n\\nubuntu 2204安装conda\\n\\t下载最新当前最新版本的conda\\n\\tconda config --set auto_activate_base false\\nconda init --reverse $SHELL\\nubuntu 20.04 安装 conda')], 'question': '如何复制conda环境？'}\n"
     ]
    }
   ],
   "source": [
    "from langchain.schema.runnable import RunnablePassthrough\n",
    "from langchain_core.runnables import RunnableParallel\n",
    "\n",
    "partial_chain = RunnableParallel( \n",
    "    {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
    ")\n",
    "response = partial_chain.invoke(\"如何复制conda环境？\")\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    "    'context': [\n",
    "        Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='v2calibration代码环境配置（conda）'), \n",
    "        Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='下载最新当前（20230520）最新版本的conda：\\n\\tAnaconda3-2023.03-1-Linux-x86_64.sh\\n\\thttps://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2023.07-1-Linux-x86_64.sh\\n\\tbash Anaconda3-2021.11-Linux-x86_64.sh\\n\\t安装位置在：/home/hey/anaconda3\\n\\t自动初始化？yes\\n\\t（如果选择yes，则每次打开新的终端时，都会默认进入到base环境）'), \n",
    "        Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='conda env list\\n\\t\\tconda info -e \\n\\t\\t$ pip list\\n\\t复制环境：conda create -n <new_env_name> --clone <origin_env_name>\\n\\t\\t通过克隆py36来创建一个称为py36_bak的副本：\\n\\t\\tconda create -n py36_bak --clone py36\\n\\t删除环境：\\n\\t\\tconda env remove -n <env_name>\\n\\t保存环境信息到environment.yaml文件中：conda env export > environment.yaml\\n\\t通过environment.yaml环境文件创建文件： conda env create -f environment.yaml\\n\\t查看已安装的包：conda list\\n\\t搜索包：\\n\\t\\t模糊搜索：conda search <package_name1>\\n\\t\\tconda list | findstr torch\\n\\t\\tconda search --full-name <package name> 搜索指定的包'), \n",
    "        Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='conda search --full-name <package name> 搜索指定的包\\n\\t\\t\\t比如 conda search --full-name tensorflow 显示所有的tensorflow包。\\n\\t安装包：conda install <package_name1> <package_name2>\\n\\t卸载包：\\n\\t\\tconda remove <package_name>\\n\\t\\t\\tconda remove --name [env_name] --all，回车后出现一系列在这个环境所安装的包；输入【y】进行环境的删除。\\n\\t\\tconda uninstall package_name(包名)\\n\\tcopy环境：\\n\\tconda create\\n\\t --\\n\\tname 新环境名\\n\\t --\\n\\tclone 旧环境名\\n\\t检查更新当前的conda版本\\n\\t\\tconda update conda\\n\\t参考'), \n",
    "        Document(metadata={'id': '1b5b5aaf-4f1b-80a4-ae93-d427b057fd26', '名称': 'conda / pip'}, page_content='$ python test.py \\n4.10.0\\n/home/hey/anaconda3/envs/test/lib/python3.8/site-packages/cv2.cpython-38-x86_64-linux-gnu.so\\n# 验证一下，完成任务了\\n\\n\\nubuntu 2204安装conda\\n\\t下载最新当前最新版本的conda\\n\\tconda config --set auto_activate_base false\\nconda init --reverse $SHELL\\nubuntu 20.04 安装 conda')\n",
    "        ], \n",
    "    'question': '如何复制conda环境？'\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "在使用 Conda 管理 Python 开发环境时，有时需要将一个已有的环境复制为另一个新环境。这可以通过导出当前环境的包列表并基于此列表创建新的环境来实现。以下是具体步骤：\n",
      "\n",
      "1. **导出现有环境配置**：\n",
      "   首先，你需要导出现有环境的所有包及其版本信息到一个文件中。\n",
      "   \n",
      "   ```bash\n",
      "   conda env export --name 现有环境名称 > environment.yml\n",
      "   ```\n",
      "   \n",
      "   这里，“现有环境名称”是你要复制的源环境的名字。`environment.yml` 文件将包含所有已安装包的信息。\n",
      "\n",
      "2. **创建新环境**：\n",
      "   使用导出的文件来创建一个新的 Conda 环境。\n",
      "   \n",
      "   ```bash\n",
      "   conda env create --name 新环境名称 --file environment.yml\n",
      "   ```\n",
      "   \n",
      "   这里，“新环境名称”是你想要给新的复制环境起的名字。\n",
      "\n",
      "3. **验证新环境**：\n",
      "   创建完成后，你可以激活这个新环境并检查其内容是否与原环境一致。\n",
      "   \n",
      "   ```bash\n",
      "   conda activate 新环境名称\n",
      "   conda list\n",
      "   ```\n",
      "\n",
      "4. **可选：手动调整**：\n",
      "   如果需要，你可以在 `environment.yml` 文件中手动修改一些包的版本或添加/删除某些包，然后再创建新环境。\n",
      "\n",
      "通过上述步骤，你可以轻松地复制一个 Conda 环境。这种方法不仅方便了环境管理，也使得在不同机器上保持一致的开发环境变得简单。\n"
     ]
    }
   ],
   "source": [
    "# 无参考资料时，gpt的回复：\n",
    "\n",
    "response = llm.invoke(\"如何复制conda环境？\")\n",
    "print(response.content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "chatbot_temp",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
